Ford GoBike System Data Exploration
This data set includes information about individual rides made in a bike-sharing system covering the greater San Francisco Bay area. It contains 16 features and well over 150,000 observations.
The first step of any data analysis project is importing the packages and the dataset to be analyzed
# Import packages
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
# Import Dataset "201902-fordgobike-tripdata.csv" and load into DataFrame
df = pd.read_csv("201902-fordgobike-tripdata.csv")
Next, let us inspect the dataset to see if it needs any wrangling.
Check for missing values
df.isnull().sum()
duration_sec 0 start_time 0 end_time 0 start_station_id 197 start_station_name 197 start_station_latitude 0 start_station_longitude 0 end_station_id 197 end_station_name 197 end_station_latitude 0 end_station_longitude 0 bike_id 0 user_type 0 member_birth_year 8265 member_gender 8265 bike_share_for_all_trip 0 dtype: int64
Now, let us drop missing values
df.dropna(inplace=True)
df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 174952 entries, 0 to 183411 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 duration_sec 174952 non-null int64 1 start_time 174952 non-null object 2 end_time 174952 non-null object 3 start_station_id 174952 non-null float64 4 start_station_name 174952 non-null object 5 start_station_latitude 174952 non-null float64 6 start_station_longitude 174952 non-null float64 7 end_station_id 174952 non-null float64 8 end_station_name 174952 non-null object 9 end_station_latitude 174952 non-null float64 10 end_station_longitude 174952 non-null float64 11 bike_id 174952 non-null int64 12 user_type 174952 non-null object 13 member_birth_year 174952 non-null float64 14 member_gender 174952 non-null object 15 bike_share_for_all_trip 174952 non-null object dtypes: float64(7), int64(2), object(7) memory usage: 22.7+ MB
Some columns have wrong Datatypes, now let us change them to the accurate Datatypes
# Change dtype from 'float' to 'int'
cols = ["start_station_id","end_station_id","member_birth_year"]
for col in cols:
df[col] = df[col].astype(int)
# Check
df[cols].info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 174952 entries, 0 to 183411 Data columns (total 3 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 start_station_id 174952 non-null int32 1 end_station_id 174952 non-null int32 2 member_birth_year 174952 non-null int32 dtypes: int32(3) memory usage: 3.3 MB
Notice, the columns "start_time" and "end_time" contain Datetime objects but are encoded as Objects.
Let us convert them to datetime objects
cols = ["start_time","end_time"]
for col in cols:
df[col] = pd.to_datetime(df[col])
# Check
df[cols].info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 174952 entries, 0 to 183411 Data columns (total 2 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 start_time 174952 non-null datetime64[ns] 1 end_time 174952 non-null datetime64[ns] dtypes: datetime64[ns](2) memory usage: 4.0 MB
df.info()
<class 'pandas.core.frame.DataFrame'> Int64Index: 174952 entries, 0 to 183411 Data columns (total 16 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 duration_sec 174952 non-null int64 1 start_time 174952 non-null datetime64[ns] 2 end_time 174952 non-null datetime64[ns] 3 start_station_id 174952 non-null int32 4 start_station_name 174952 non-null object 5 start_station_latitude 174952 non-null float64 6 start_station_longitude 174952 non-null float64 7 end_station_id 174952 non-null int32 8 end_station_name 174952 non-null object 9 end_station_latitude 174952 non-null float64 10 end_station_longitude 174952 non-null float64 11 bike_id 174952 non-null int64 12 user_type 174952 non-null object 13 member_birth_year 174952 non-null int32 14 member_gender 174952 non-null object 15 bike_share_for_all_trip 174952 non-null object dtypes: datetime64[ns](2), float64(4), int32(3), int64(2), object(5) memory usage: 20.7+ MB
Let us see the distribution of the observations in the "user_type" variable
fig,ax = plt.subplots(figsize=(10,6))
color = sns.color_palette()[0]
sns.countplot(data = df, x = "user_type",color= color)
plt.xlabel("User Type")
plt.ylabel("Frequency [Count]")
plt.title("Distribution of User Types");
print(df.user_type.value_counts(normalize=True))
Subscriber 0.905311 Customer 0.094689 Name: user_type, dtype: float64
We can see that the distribution of user types in our dataset is massively unbalanced with over 90% of users being Subscribers.
Next, let us look at the distribution of the "duration_sec" variable using a histogram
# plot 'duration_sec' histogram
bin_range = np.arange(0,df.duration_sec.max()+200,10)
plt.hist(data= df,x="duration_sec",bins=bin_range)
plt.xlabel("Time:[sec]")
plt.ylabel("Frequency [count]")
plt.title("Distribution of Time [Sec]");
We can see from the visualizayion that the duration_sec variable is highly Right skewed but let us go further remove the top and bottom ten values and see if we can get a better visualization of these distribution.
# Get the 10th and 90th quantile
low,high = df.duration_sec.quantile([0.1,0.9])
# Extract mask of values between "low" and "high" and plot
mask = df.duration_sec.between(low,high)
df_mask = df[mask]
sns.displot(df_mask["duration_sec"])
plt.xlabel("Time:[sec]")
plt.ylabel("Frequency [count]")
plt.title("Distribution of Trimmed Time [Sec]");
Now we can clearly see that Right-Skewed distrubtion in our duration_sec variable.
Next, let us look at the proportion (distribution) of the genders in our DataFrame
fig,ax = plt.subplots(figsize=(10,6))
df.member_gender.value_counts().plot.pie()
plt.xlabel("User Gender")
plt.ylabel("Frequency [Count]")
plt.title("Distribution of User Gender");
print(df.member_gender.value_counts(normalize=True))
Male 0.745919 Female 0.233235 Other 0.020846 Name: member_gender, dtype: float64
We can see that over 75% of our users are Males and just about 2% of our users do not identify as Male or Female.
Let us explore the distribution of member_birth_year
fig,ax = plt.subplots(figsize=(20,6))
color = sns.color_palette()[0]
sns.countplot(data = df, x="member_birth_year",color = color)
plt.xticks(rotation=90)
plt.title("Distibution of Birth Dates");
We can see that the bulk of the members in our DataFrame were born between 1982 - 1996
fig,ax = plt.subplots(figsize=(10,6))
sns.countplot(data = df, x= "bike_share_for_all_trip",color = sns.color_palette()[0])
plt.ylabel("Frequency [count]")
plt.xlabel("Bike shared for all trip?")
plt.title("Distribution of bike_share_for_all_trip")
print(df.bike_share_for_all_trip.value_counts(normalize=True))
No 0.900853 Yes 0.099147 Name: bike_share_for_all_trip, dtype: float64
We can see that only about 10 percent of the riders in our dataset share the bikes during their trips
Even though start_station_id is Numerical, we know it contains a categorical Data, so let us use a countplot to check the distribution
# Plot a counplot of "start_station_id"
fig,ax = plt.subplots(figsize=(20,6))
sns.countplot(data= df,x="start_station_id",color = sns.color_palette()[0])
plt.xticks([])
plt.title("Distribution of Station IDs");
print(f"Number of Unique Station IDs : {df.start_station_id.nunique()}")
Number of Unique Station IDs : 329
Our plot is difficult to interpret because it has a lot of categories - 329 Categories
Now let us look at the relationship between Variable(s) in our DataFrame.
We can start by visualizing the relationship between variables member_gender and user_type
# plot side-by-side countplot of 'member_gender' and 'user_type'
fig,ax= plt.subplots(figsize=(10,6))
order =df.member_gender.value_counts().index
fig =sns.countplot(data=df,x="member_gender", hue="user_type",order=order)
While we can assume that higher proportion of Males are subscribers than the other two gender categories, it would be better to look at the actual numbers
df.user_type.groupby([df.member_gender]).value_counts(normalize=True).to_frame()
| user_type | ||
|---|---|---|
| member_gender | user_type | |
| Female | Subscriber | 0.886705 |
| Customer | 0.113295 | |
| Male | Subscriber | 0.911931 |
| Customer | 0.088069 | |
| Other | Subscriber | 0.876611 |
| Customer | 0.123389 |
We can see that indeed 91% of Male users are subscribers in contrast with about 89% and 88% for Female and Other respectively.
For this exploration, we will do some Feature Engineering to extract the feature day_of_week from the variable start_tine
# create feature "day_of_week"
df["day_of_week"] = df["start_time"].dt.dayofweek
Let us change the Encoding of day_of_week
# Create dw_mapping dic
dw_mapping={
0: 'Monday',
1: 'Tuesday',
2: 'Wednesday',
3: 'Thursday',
4: 'Friday',
5: 'Saturday',
6: 'Sunday'
}
# Change 'day_of_week' encoding
df['day_of_week']=df['day_of_week'].map(dw_mapping)
Now, Let us conduct a Bivariate Exploration of the variables day_of_week and duration_sec to determine which Day of the week bikers spent the highest time [sec].
# Aggegrate "day_of_week" by mean "duration_sec"
day_average = df.duration_sec.groupby(df.day_of_week).mean()
day_average
day_of_week Friday 681.274699 Monday 696.970906 Saturday 849.771750 Sunday 858.673649 Thursday 678.289748 Tuesday 652.839459 Wednesday 665.127946 Name: duration_sec, dtype: float64
Since Days of the week is an ordinal data, let us rearrange the index accordingly.
day_order = ["Monday","Tuesday","Wednesday","Thursday","Friday","Saturday","Sunday"]
day_average = day_average.reindex(day_order)
day_average
day_of_week Monday 696.970906 Tuesday 652.839459 Wednesday 665.127946 Thursday 678.289748 Friday 681.274699 Saturday 849.771750 Sunday 858.673649 Name: duration_sec, dtype: float64
fig,ax = plt.subplots(figsize=(10,6))
color = ["lightblue","lightblue","lightblue","lightblue","lightblue","lightblue","orange"]
plt.bar(day_average.index, day_average.values,color=color)
plt.xlabel("Day of the week")
plt.ylabel("Average Duration [Sec]")
plt.title("Average Duration [sec] by day");
We can see that Bike riders in our Dataset spent more time on an average riding on Sundays
Let us use a boxplot to visualize the relationship between the various categories of genders in our Dataset and the time they spent riding
# Create boxplot of 'member_gender' vs 'duration_sec'
sns.boxplot(x= "member_gender",y="duration_sec",data=df,order = ["Male","Female","Other"],
color = sns.color_palette()[0]);
We can clearly see the outliers we saw during the univariate exploration of duration_sec
Let us create another boxplot using df_mask the dataframe we created by trimming the duration_sec variable
sns.boxplot(x= "member_gender",y="duration_sec",data=df_mask,order = ["Male","Female","Other"],
color = sns.color_palette()[0]);
We can see that 50% of Female riders spent more time riding than their Male and Other counterparts
Finally,
Let us check the correlation between variables in our Dataset
plt.subplots(figsize=(10,6))
corr =df.corr()
sns.heatmap(corr,cmap="Blues");
We can see expected high levels of correlations between Location variables 'start_station_latitude', and 'end_station_latitude' and also between 'start_station_longitude' and 'end_station_longitude' and also some moderate to low levels of correlation between different other variables in our Dataset.
Using Plotly, Let us visualize the location of our bike stations in a map of San Francisco.
# Create a scatter_mapbox of bike stations
fig = px.scatter_mapbox(
df,
lat = 'start_station_latitude',
lon = 'start_station_longitude',
width = 600,
height = 600,
center = {"lat":37.804562,
"lon":-122.24878},
title = "Map of San Fransico Bay showing the Various Bike Stations",
hover_data = ["start_station_name"]
)
fig.update_layout(mapbox_style = "carto-positron")
fig.show(renderer='notebook')